JVM Deep-Dive
Architecture Handbook

HotSpot VM Internals · JIT Compilation · Garbage Collection · Bytecode to Silicon — For Systems Engineers and FAANG Candidates

JVM Engineers OpenJDK Contributors HotSpot Internals JIT Compilation GC Algorithms FAANG Interviews Distributed Systems Performance Engineering

Complete Java Execution Pipeline

The journey from a .java source file to CPU instructions is a multi-stage pipeline involving lexical analysis, bytecode generation, class loading, bytecode verification, linking, JIT compilation, machine code generation, and finally CPU execution. Understanding each stage is fundamental to diagnosing performance problems, memory issues, and subtle concurrency bugs in production systems.

Full Pipeline Overview

┌─────────────────────────────────────────────────────────────────────┐ │ JAVA EXECUTION PIPELINE │ │ │ │ ┌──────────────┐ javac ┌──────────────┐ │ │ │ .java Source│ ──────────────▶│ .class Byte │ │ │ │ (Source Code│ │ code File │ │ │ └──────────────┘ └──────┬───────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ ClassLoader Sub │ │ │ │ system │ │ │ │ Loading │ │ │ │ → Verification │ │ │ │ → Linking │ │ │ │ → Initialization│ │ │ └────────┬─────────┘ │ │ │ │ │ ┌────────────▼────────────┐ │ │ │ Execution Engine │ │ │ │ │ │ │ ┌──────▼──────┐ ┌─────────────▼───┐ │ │ │ Interpreter │ │ JIT Compiler │ │ │ │ (Tier 0/1) │ │ C1 → C2 Tier │ │ │ └──────┬──────┘ └────────┬────────┘ │ │ │ │ │ │ │ ┌──────────▼──────────┐ │ │ │ │ Native Machine Code │ │ │ │ │ (x86_64 / AArch64) │ │ │ └─────────┴──────────┬──────────┘ │ │ │ │ │ ┌──────────▼──────────┐ │ │ │ CPU Pipeline │ │ │ │ Execution │ │ │ └─────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘

Stage 1 — javac Compilation

The javac compiler performs: lexical analysis (tokenization), syntax analysis (parse tree), semantic analysis (type checking, name resolution), constant folding, and bytecode generation. It outputs a .class file containing the JVM bytecode — a platform-independent intermediate representation.

The .class file contains: a magic number (0xCAFEBABE), major/minor version, constant pool, access flags, class/interface hierarchy, field descriptors, method descriptors, bytecode for each method, and attribute tables (SourceFile, LineNumberTable, LocalVariableTable, etc.).

// Java Source
public class Add {
    public static int add(int a, int b) {
        return a + b;
    }
}

// Compiled bytecode (javap -c Add.class)
public static int add(int, int);
  Code:
    0: iload_0         // push a onto operand stack
    1: iload_1         // push b onto operand stack
    2: iadd            // pop 2, add, push result
    3: ireturn         // return int on top of stack

Stage 2 — Class Loading

The ClassLoader subsystem reads .class bytes into the JVM's Method Area (Metaspace). The three built-in loaders form a strict delegation hierarchy. Loading is lazy by default — a class is not loaded until it is first actively used.

Stage 3 — Bytecode Verification

The bytecode verifier enforces JVM type safety. It checks: that the stack never overflows or underflows, that local variable types are consistent, that field and method references are valid, and that final methods are not overridden. This is a critical security boundary — it ensures that even malicious bytecode cannot subvert JVM memory safety.

Stage 4 — Linking (Prepare + Resolve)

Preparation allocates memory for static fields and sets them to default values (0, null, false). Resolution replaces symbolic references in the constant pool with direct references (memory pointers) to classes, fields, and methods. Resolution can be eager or lazy depending on JVM flags.

Stage 5 — Initialization

The class's <clinit> method is invoked — executing static field assignments and static initializer blocks in textual order. Initialization is synchronized: the JVM guarantees that only one thread initializes a class, and subsequent threads see the initialized state without synchronization (via the class loading lock).

Stage 6 — Interpreter Execution (Tier 0)

The template interpreter dispatches each bytecode instruction via a pre-generated native code fragment (a "template"). Each bytecode has its own assembly stub. The interpreter maintains a per-method invocation counter. When a method's counter exceeds CompileThreshold (default: 10,000), it is submitted to the JIT compiler queue.

Stage 7 — Tiered JIT Compilation

HotSpot uses tiered compilation: Tier 0 = interpreter, Tier 1 = C1 (no profiling), Tier 2 = C1 (limited profiling), Tier 3 = C1 (full profiling), Tier 4 = C2 (maximum optimization). Methods graduate through tiers based on invocation and back-edge (loop) counters.

Stage 8 — Machine Code Execution

JIT-compiled code is stored in the Code Cache (a JVM-managed native memory region). Execution jumps into this cache, bypassing the interpreter entirely. The JVM may deoptimize compiled code back to the interpreter if speculative optimizations (like inlining based on CHA) are invalidated by class loading events.

JVM Architecture — System Level

HotSpot is a complex C++ application (~4 million lines of code in OpenJDK). Understanding its subsystem boundaries and their interactions is essential for advanced JVM tuning and modification.

┌────────────────────────────────────────────────────────────────────────┐ │ JVM PROCESS BOUNDARY │ │ │ │ ┌──────────────────────────────────────────────────────────────────┐ │ │ │ CLASS LOADER SUBSYSTEM │ │ │ │ Bootstrap CL ──▶ Platform CL ──▶ App CL ──▶ Custom CL │ │ │ │ │ │ │ │ │ (loads rt.jar, │ │ │ │ java.base) │ │ │ └──────────────────────────────────────────────────────────────────┘ │ │ │ loads into │ │ ┌────────────────▼─────────────────────────────────────────────────┐ │ │ │ RUNTIME DATA AREAS │ │ │ │ │ │ │ │ ┌─────────┐ ┌─────────────────────────────────────────────┐ │ │ │ │ │Metaspace│ │ HEAP │ │ │ │ │ │(class │ │ ┌───────────────┐ ┌────────────────────┐ │ │ │ │ │ │metadata,│ │ │ Young Gen │ │ Old Gen │ │ │ │ │ │ │method │ │ │ Eden│S0│S1 │ │ (Tenured) │ │ │ │ │ │ │data) │ │ └───────────────┘ └────────────────────┘ │ │ │ │ │ └─────────┘ └─────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ Per-Thread: │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │ │ │ │ JVM Stack│ │PC Reg │ │Native │ │ JVM Stack │ │ │ │ │ │ Thread-1 │ │Thread-1 │ │Method Stk│ │ Thread-N │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────────┘ │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ┌────────────────▼─────────────────────────────────────────────────┐ │ │ │ EXECUTION ENGINE │ │ │ │ ┌───────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ │ │ Interpreter │ │ JIT Compiler │ │ Garbage Collectors │ │ │ │ │ │ (Template) │ │ C1 / C2 │ │ G1 / ZGC / Shen. │ │ │ │ │ └───────────────┘ └──────────────┘ └──────────────────────┘ │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ┌────────────────▼─────────────────────────────────────────────────┐ │ │ │ NATIVE METHOD INTERFACE (JNI) │ │ │ │ Native Libraries (libc, libpthread, etc.) │ │ │ └───────────────────────────────────────────────────────────────────┘ │ └────────────────────────────────────────────────────────────────────────┘

ClassLoader Subsystem

Responsible for finding, loading, verifying, and initializing class files. Interacts directly with the file system, JAR files, module descriptors, and the network (for remote class loading). The result of loading is stored in Metaspace as a Klass C++ object, paired with a heap-allocated java.lang.Class mirror object.

Runtime Data Areas

Metaspace (native memory): class metadata, method bytecode, constant pools, vtables, itables. Heap (GC-managed): all Java objects and arrays. Thread Stacks: per-thread, contains frames for each method call. PC Register: points to current bytecode (undefined for native methods). Native Method Stack: C stack used when executing JNI native methods.

Execution Engine

The execution engine has three major components that cooperate continuously: the template interpreter (executes bytecode until hot), the JIT compiler (compiles hot methods to native code), and the garbage collector (reclaims dead objects while respecting safepoints and GC barriers emitted by both the interpreter and JIT).

Native Interface (JNI)

JNI bridges Java and native C/C++ code. The JVM pushes/pops JNI frames on the thread's native call stack. JNI handles require explicit lifecycle management — GC roots are pinned while in native frames to prevent object relocation by concurrent collectors.

HotSpot Source Code Structure

Navigating the OpenJDK HotSpot source tree is daunting initially. Understanding the layout enables targeted exploration and modification.

OpenJDK/src/hotspot/share/ ├── runtime/ ← JVM lifecycle, threads, safepoints, monitors │ ├── thread.hpp/cpp ← JavaThread, VMThread, CompilerThread │ ├── safepoint.hpp/cpp ← Safepoint mechanism │ ├── synchronizer.hpp/cpp ← Monitor inflation, ObjectSynchronizer │ └── reflectionUtils.hpp ← Reflection support │ ├── interpreter/ ← Template interpreter, bytecode dispatch │ ├── templateInterpreter.hpp/cpp │ ├── bytecodeInterpreter.hpp/cpp │ └── templateTable.hpp/cpp ← Per-bytecode assembly generation │ ├── compiler/ ← JIT compilation infrastructure │ ├── compileBroker.hpp/cpp ← Compilation task queue │ └── compilerDefinitions.hpp │ ├── gc/ ← All GC implementations │ ├── shared/ ← Common GC infrastructure │ ├── g1/ ← G1 GC │ ├── z/ ← ZGC │ ├── shenandoah/ ← Shenandoah GC │ ├── parallel/ ← Parallel GC │ └── serial/ ← Serial GC │ ├── oops/ ← Object representation (oops = Ordinary Object Pointers) │ ├── oop.hpp/cpp ← Base object representation │ ├── klass.hpp/cpp ← Class metadata │ ├── markWord.hpp/cpp ← Object header mark word │ ├── instanceKlass.hpp ← Regular class metadata │ └── arrayKlass.hpp ← Array class metadata │ ├── memory/ ← Memory management │ ├── heap.hpp/cpp ← Code heap │ ├── metaspace/ ← Metaspace internals │ └── allocation.hpp ← JVM internal allocation │ ├── jit/ ← C1 and C2 compiler bridges ├── classfile/ ← Class file parsing, class loading └── prims/ ← JNI, JVMTI implementations

Key Source Files for JVM Internals

File	Responsibility	Key Concepts
`oops/oop.hpp`	Base class for all Java objects in JVM	Mark word, klass pointer, field layout
`oops/klass.hpp`	Class metadata representation	vtable, itable, layout helper
`oops/markWord.hpp`	Object header mark word layout	Lock state, hash code, GC age
`runtime/safepoint.cpp`	Safepoint polling and synchronization	Thread suspension, deoptimization
`interpreter/templateTable.cpp`	Per-bytecode native code templates	Bytecode dispatch, stack manipulation
`gc/g1/g1CollectedHeap.cpp`	G1 GC heap management	Region management, concurrent marking
`compiler/compileBroker.cpp`	JIT compilation task management	Compilation queue, tier transitions

OOP Hierarchy

In HotSpot, "oop" means Ordinary Object Pointer — a pointer to a Java object on the heap. The C++ type hierarchy is:

oop                          // base pointer type
  └─ instanceOop           // regular Java object instance
  └─ arrayOop              // base for arrays
       └─ objArrayOop      // array of object references
       └─ typeArrayOop     // array of primitives (int[], byte[], etc.)

Klass                        // class metadata (in Metaspace)
  └─ InstanceKlass         // regular class
  └─ ArrayKlass            // array class
       └─ ObjArrayKlass    // reference array
       └─ TypeArrayKlass   // primitive array

Class Loading Internals

Class loading is a three-phase process: Loading (finding and reading the class file bytes), Linking (verification + preparation + resolution), and Initialization (executing <clinit>). Each phase has strict rules and ordering guarantees.

ClassLoader Hierarchy

┌─────────────────────────────┐ │ Bootstrap ClassLoader │ │ (C++ code, no Java object) │ │ Loads: java.base module │ │ Source: JAVA_HOME/lib/ │ └──────────────┬──────────────┘ │ parent ┌──────────────▼──────────────┐ │ Platform ClassLoader │ │ (was Extension CL in Java 8)│ │ Loads: java.se and other │ │ platform modules │ └──────────────┬──────────────┘ │ parent ┌──────────────▼──────────────┐ │ Application (System) CL │ │ Loads: classpath entries │ │ -cp / CLASSPATH env │ └──────────────┬──────────────┘ │ parent ┌──────────────▼──────────────┐ │ User-Defined ClassLoaders │ │ (OSGi, Spring, Tomcat, etc)│ └─────────────────────────────┘

Parent Delegation Model

When a ClassLoader receives a loadClass(name) request, it always delegates to its parent first. Only if the parent cannot find the class does the child attempt to load it. This ensures that core Java classes (e.g., java.lang.String) are always loaded by Bootstrap, preventing malicious replacement of system classes.

// ClassLoader.loadClass() — simplified logic
protected synchronized Class<?> loadClass(String name, boolean resolve) {
    Class<?> c = findLoadedClass(name);      // check cache
    if (c == null) {
        try {
            if (parent != null)
                c = parent.loadClass(name, false);  // delegate up
            else
                c = findBootstrapClass(name);         // bootstrap
        } catch (ClassNotFoundException e) {
            c = findClass(name);   // child handles it
        }
    }
    if (resolve) resolveClass(c);
    return c;
}

Class Identity

In the JVM, a class is uniquely identified by the tuple (fully-qualified-name, ClassLoader-instance). Two classes with the same name loaded by different ClassLoaders are completely different types — instances of one cannot be cast to the other, even if they have identical bytecode. This is the source of the notorious ClassCastException in application servers.

ClassLoader Leak: If a ClassLoader instance is held alive by any live reference (e.g., a static field in a class it loaded, a thread context classloader), all classes it loaded — and their static fields — are retained in Metaspace, causing OutOfMemoryError: Metaspace in hot-deploy scenarios.

Custom ClassLoader Implementation

public class BytecodeClassLoader extends ClassLoader {
    private final byte[] bytecode;

    public BytecodeClassLoader(byte[] bytecode) {
        super(ClassLoader.getSystemClassLoader());
        this.bytecode = bytecode;
    }

    @Override
    protected Class<?> findClass(String name) {
        // defineClass performs Linking (verify + prepare + resolve)
        return defineClass(name, bytecode, 0, bytecode.length);
    }
}

Dynamic Module Loading (JPMS)

In Java 9+, the module system wraps ClassLoaders with module visibility rules. A class can only load another class if the module declaring it exports the package. The Bootstrap CL is now responsible for java.base, while the Platform CL covers other JDK modules. Named modules provide stronger encapsulation guarantees than the old classpath model.

Linking — Deep Internals

Phase	What Happens	Errors Thrown
Verification	Bytecode format checks, type safety validation, stack shape analysis, instruction reachability	`VerifyError`
Preparation	Static field memory allocated; set to zero-values. No code executed yet.	OOM for Metaspace
Resolution	Symbolic refs in constant pool replaced with direct refs (pointers). Lazy or eager per JVM.	`NoClassDefFoundError`, `LinkageError`

Class Initialization Order — Exact Execution Rules

Class initialization ordering is one of the most-tested JVM topics in FAANG interviews. The JVM specification defines a precise ordering that must be understood at the bytecode level.

Complete Ordering Rules

CLASS INITIALIZATION ORDER (JVM Spec §5.5) ───────────────────────────────────────────────────── Step 1: Superclass initialized first (recursively) └─ Object is always initialized before any class Step 2: Static variable default values (0/null/false) └─ Assigned during PREPARATION, not initialization Step 3: Static initializers and static variable assignments └─ Executed in TEXTUAL ORDER (top to bottom) └─ Combined into a single () method INSTANCE CREATION ORDER (per new invocation) ───────────────────────────────────────────────────── Step 1: Allocate memory on Heap, zero-initialize all fields Step 2: Instance variable default values (already done above) Step 3: Instance initializer blocks (IIBs) and instance variable assignments, in TEXTUAL ORDER, combined into () Step 4: Constructor body executes AFTER IIBs └─ super() is always FIRST call in constructor Step 5: Instance methods are callable only after construction

Initialization Triggers (Active Uses)

A class is initialized only when it is actively used. Passive uses (e.g., accessing a static final compile-time constant, creating an array of the type) do not trigger initialization.

Trigger	Active Use?	Notes
`new ClassName()`	✅ Yes	Most common trigger
Call a static method	✅ Yes	Inherited static methods initialize the declaring class
Access/assign a static field	✅ Yes	Unless it's a `static final` compile-time constant
`Class.forName("Foo")`	✅ Yes	By default; use `initialize=false` to suppress
First invocation of a subclass	✅ Yes	Triggers superclass initialization first
Access `static final int X = 5`	❌ No	Compile-time constant — inlined by javac
`new Foo[]`	❌ No	Array creation doesn't initialize element type

Tricky Interview Case — Static Initialization Ordering

class Parent {
    static int x = initX();          // Step A
    static { System.out.println("Parent static block, x=" + x); } // Step B
    static int initX() { return 10; }
}

class Child extends Parent {
    static int y = 20;                // Step D (after Parent init)
    static { System.out.println("Child static block, y=" + y); } // Step E
    int instanceY = 100;             // Step G (per instance)
    { System.out.println("Child IIB, instanceY=" + instanceY); } // Step H
    Child() { System.out.println("Child constructor"); }  // Step I
}

// new Child() output:
// Parent static block, x=10       ← Parent <clinit> runs first
// Child static block, y=20        ← Child <clinit> runs second
// Child IIB, instanceY=100        ← IIB before constructor body
// Child constructor               ← Constructor last

The Forward Reference Trap

class ForwardRef {
    static int a = b + 1;   // b is 0 here! (default value during preparation)
    static int b = 5;

    public static void main(String[] args) {
        System.out.println(a);  // prints 1, NOT 6!
        System.out.println(b);  // prints 5
    }
}

Interview Trap: Static fields are set to default values (0) during Preparation. During Initialization they are assigned in textual order. a = b + 1 executes when b is still 0, giving a = 1, not 6.

Initialization Deadlock

The JVM uses a per-class initialization lock. If two threads race to initialize classes A and B, where A's <clinit> references B and B's <clinit> references A, a circular initialization deadlock occurs. The JVM specification (§5.5 step 3) documents this as a potential deadlock — the JVM does not detect or prevent it.

JVM Bytecode Engine — Stack-Based Execution

The JVM is a stack-based virtual machine. Unlike register-based VMs (like Dalvik/ART), all bytecode instructions operate on an operand stack within a stack frame. Understanding this model is essential for reasoning about bytecode correctness and JIT optimization opportunities.

Stack Frame Structure

┌─────────────────────────────────────────────────────┐ │ JVM STACK FRAME │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ OPERAND STACK │ │ │ │ ┌───────┐ ← top of stack (TOS) │ │ │ │ │ val3 │ │ │ │ │ ├───────┤ │ │ │ │ │ val2 │ │ │ │ │ ├───────┤ │ │ │ │ │ val1 │ ← bottom of stack │ │ │ │ └───────┘ │ │ │ │ Max depth = max_stack (from .class file) │ │ │ └─────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ LOCAL VARIABLE TABLE │ │ │ │ [0] = this (instance methods) │ │ │ │ [1] = first parameter │ │ │ │ [2] = second parameter │ │ │ │ [3] = first local variable │ │ │ │ ... │ │ │ │ Max slots = max_locals (from .class file) │ │ │ └─────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ FRAME DATA │ │ │ │ Return address (caller PC) │ │ │ │ Reference to runtime constant pool │ │ │ │ Exception dispatch table pointer │ │ │ └─────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────┘

Bytecode Example — Arithmetic

// Java source
public static int compute(int a, int b) {
    int c = a + b;
    return c * 2;
}

// Bytecode (javap -c)
0: iload_0    // push local[0] (a) → stack: [a]
1: iload_1    // push local[1] (b) → stack: [a, b]
2: iadd       // pop 2, add, push   → stack: [a+b]
3: istore_2   // pop → local[2] (c) → stack: []
4: iload_2    // push local[2] (c)  → stack: [c]
5: iconst_2   // push constant 2    → stack: [c, 2]
6: imul       // pop 2, multiply    → stack: [c*2]
7: ireturn    // return int on TOS

Complete Bytecode Instruction Reference

Category	Instructions	Description
Load (stack push)	`iload, lload, fload, dload, aload`	Push local variable onto stack
Store (stack pop)	`istore, lstore, fstore, dstore, astore`	Pop stack top to local variable
Constants	`iconst_0..5, lconst, fconst, dconst, aconst_null, bipush, sipush, ldc`	Push literal/constant pool values
Arithmetic	`iadd, isub, imul, idiv, irem, ineg, ishl, ishr, iushr, iand, ior, ixor`	Integer arithmetic/bitwise
Long/Float/Double	`ladd, fadd, dadd, ...`	64-bit and floating-point ops
Conversion	`i2l, i2f, i2d, l2i, f2i, d2i, i2b, i2c, i2s`	Primitive type widening/narrowing
Object creation	`new, newarray, anewarray, multianewarray`	Heap allocation
Field access	`getfield, putfield, getstatic, putstatic`	Instance and static field I/O
Method invocation	`invokevirtual, invokeinterface, invokestatic, invokespecial, invokedynamic`	Method dispatch
Control flow	`if_icmpeq, if_icmplt, goto, tableswitch, lookupswitch`	Branches and jumps
Return	`ireturn, lreturn, freturn, dreturn, areturn, return`	Method return with value
Stack ops	`pop, pop2, dup, dup2, swap`	Stack manipulation
Monitors	`monitorenter, monitorexit`	Synchronized block enter/exit

Method Invocation Types — Critical Distinction

Instruction	Use Case	Dispatch Mechanism
`invokevirtual`	Regular instance method call	vtable lookup (polymorphic)
`invokeinterface`	Interface method call	itable lookup (slower than vtable)
`invokestatic`	Static method call	Direct (no dispatch needed)
`invokespecial`	`super()`, `this()`, private, `<init>`	Direct reference, no vtable
`invokedynamic`	Lambdas, method handles, Groovy/Kotlin	Bootstrap method + call site

Template Interpreter — How Bytecode Dispatches

HotSpot's template interpreter does NOT use a switch/case bytecode loop. Instead, at JVM startup, it pre-generates assembly stubs for every bytecode. The dispatch table is a fixed array of code pointers. At the end of each bytecode stub, the interpreter reads the next opcode byte and jumps directly to the next stub — this is called threaded dispatch.

// Conceptual dispatcher (actual code is assembly generated by templateTable.cpp)
// After each instruction, dispatch to next:
// 1. Read byte at PC → opcode
// 2. Increment PC
// 3. Jump to dispatch_table[opcode]

// This eliminates branch mispredictions on the dispatch switch
// and allows CPU to pipeline bytecode execution

Runtime Data Areas — Deep Internals

Complete Memory Map

PROCESS VIRTUAL ADDRESS SPACE ──────────────────────────────────────────────────────────────────── High Address │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ NATIVE MEMORY (OS-allocated, not GC-managed) │ │ │ │ │ │ ┌─────────────┐ ← MetaspaceSize / MaxMetaspaceSize │ │ │ │ Metaspace │ Class metadata, method bytecode, │ │ │ │ │ constant pools, vtables, itables │ │ │ └─────────────┘ │ │ │ ┌─────────────┐ │ │ │ │ Code Cache │ JIT-compiled native code (nmethod) │ │ │ │ │ -XX:ReservedCodeCacheSize (default 240M) │ │ │ └─────────────┘ │ │ └──────────────────────────────────────────────────────────────┘ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ JAVA HEAP (GC-managed) -Xms / -Xmx │ │ │ │ │ │ G1/CMS/Parallel: │ │ │ ┌───────────────────────┐ ┌──────────────────────────────┐ │ │ │ │ Young Generation │ │ Old Generation (Tenured) │ │ │ │ │ ┌──────┬────┬────┐ │ │ │ │ │ │ │ │ Eden │ S0 │ S1 │ │ │ Long-lived objects │ │ │ │ │ └──────┴────┴────┘ │ │ Classes' static fields │ │ │ │ └───────────────────────┘ └──────────────────────────────┘ │ │ └──────────────────────────────────────────────────────────────┘ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ PER-THREAD AREAS (1 per JavaThread) │ │ │ │ │ │ ┌──────────────────────────────┐ │ │ │ │ JVM Stack (-Xss, default 1M)│ Stack frames for each │ │ │ │ ┌──────────────────────────┐ │ Java method call │ │ │ │ │ Frame N (deepest) │ │ │ │ │ │ ├──────────────────────────┤ │ │ │ │ │ │ Frame 2 │ │ │ │ │ │ ├──────────────────────────┤ │ │ │ │ │ │ Frame 1 (main method) │ │ │ │ │ │ └──────────────────────────┘ │ │ │ │ └──────────────────────────────┘ │ │ │ ┌───────────┐ ← 4 bytes (32-bit) or 8 bytes (64-bit) │ │ │ │ PC Register│ bytecode instruction pointer per thread │ │ │ └───────────┘ │ │ │ ┌───────────────────┐ │ │ │ │ Native Method Stack│ C stack for JNI calls │ │ │ └───────────────────┘ │ │ └──────────────────────────────────────────────────────────────┘ Low Address

Metaspace Deep Dive

Metaspace replaced PermGen in Java 8. It resides in native memory (outside the Java heap), managed by its own allocator using memory-mapped files (mmap). Key structures stored in Metaspace:

InstanceKlass: class metadata, field descriptors, method descriptors
ConstantPool: the class's runtime constant pool (symbolic + resolved references)
Method: method metadata, bytecode array, exception table
vtable/itable: virtual dispatch tables for polymorphic method calls
Annotations, generic signatures: reflection metadata

Static Variables: In Java 8+, static fields are stored as part of the java.lang.Class mirror object on the heap, not in Metaspace. This means static fields ARE subject to GC (the Class object must remain reachable).

Thread Stack — Frame Layout

Each method invocation creates a stack frame. The JVM spec defines the frame's logical contents: local variable table, operand stack, frame data. In HotSpot's template interpreter, additional hidden slots are added for interpreter state (method pointer, bytecode pointer, last SP saved for deoptimization).

Memory Area	Thread-local?	GC-managed?	Overflow Error
Heap	❌ Shared	✅ Yes	`OutOfMemoryError`
Metaspace	❌ Shared	⚠️ On CL GC	`OutOfMemoryError: Metaspace`
Thread Stack	✅ Per-thread	❌ No	`StackOverflowError`
PC Register	✅ Per-thread	❌ No	N/A
Native Stack	✅ Per-thread	❌ No	Native stack overflow (SIGSEGV)
Code Cache	❌ Shared	❌ No	`OutOfMemoryError: Code Cache`

HotSpot Object Memory Layout

Every Java object on the heap has a precise binary layout defined by HotSpot. Understanding this layout is essential for memory analysis, GC tuning, and off-heap programming with tools like Unsafe or JEP 454 (Foreign Memory API).

Object Layout Diagram

HEAP MEMORY — Object Layout (64-bit JVM, compressed OOPs enabled) ───────────────────────────────────────────────────────────────────── Offset Size Field ─────────────────────────────────────────────────────────────────── +0 8 B MARK WORD ┌─────────────────────────────────────────────────┐ │ Bits 63..3: varies by lock state (see below) │ │ Bits 2..1: lock bits 00=unlocked │ │ (fastpath) 01=biased │ │ 10=lightweight │ │ 11=heavyweight │ │ Bit 0: age bits (GC age, 4 bits max=15) │ └─────────────────────────────────────────────────┘ +8 4 B KLASS POINTER (compressed oop → 32-bit) Pointer to Klass in Metaspace (null with -XX:-UseCompressedClassPointers) +12 0-4B PADDING (alignment to 8-byte boundary) +12/16 N B INSTANCE DATA Fields laid out by HotSpot (NOT necessarily in declaration order — arranged by: 1. doubles/longs (8B) 2. ints/floats (4B) 3. shorts/chars (2B) 4. bytes/booleans (1B) 5. oops (references, 4B compressed / 8B full) 6. Padding to align to 8B) ARRAYS additionally have: +12/16 4 B ARRAY LENGTH field (before element data)

Mark Word States

State	Bits Layout	When Active
Unlocked	`[identity hash (25)][age (4)][0][01]`	Normal unhashed, unlocked object
Biased	`[thread ID (54)][epoch (2)][age (4)][1][01]`	Lock biased toward a thread (eliminated in Java 21)
Lightweight	`[ptr to lock record (62)][00]`	Thread holds lock, uncontended
Heavyweight	`[ptr to ObjectMonitor (62)][10]`	Inflated monitor, contended lock
GC mark	`[forwarding ptr (62)][11]`	Object being moved by GC

Compressed OOPs

On 64-bit JVMs with heap < 32GB, -XX:+UseCompressedOops (default on) stores object references as 32-bit values. The JVM transparently scales these compressed pointers by a factor of 8 (since all objects are 8-byte aligned), effectively addressing 32GB with 32 bits. This reduces memory footprint by ~30–40% compared to uncompressed 64-bit pointers.

// Object size calculation example
class Example {
    int    a;    // 4 bytes
    long   b;    // 8 bytes
    byte   c;    // 1 byte
    Object ref;  // 4 bytes (compressed oop)
}
// Layout (HotSpot field reordering):
//  +0:  mark word       8 bytes
//  +8:  klass pointer   4 bytes (compressed)
//  +12: int a           4 bytes    ← HotSpot puts 4B fields first after klass
//  +16: long b          8 bytes    ← then 8B fields
//  +24: byte c          1 byte
//  +25: 3 bytes padding
//  +28: ref (oop)       4 bytes (compressed)
//  Total: 32 bytes

// Use JOL (Java Object Layout) to inspect:
// System.out.println(ClassLayout.parseClass(Example.class).toPrintable());

Identity Hash Code in Mark Word

The first call to System.identityHashCode(obj) (or obj.hashCode() if not overridden) causes the JVM to compute a hash and store it in the mark word. This is a one-time, lazy computation. The hash value is permanently embedded in the object header — once set, the object can never be biased-locked (the bits are occupied).

Static vs Instance Variables — Memory Diagrams

class Test {
    static int    a = 42;      // class-level, stored in Class object on heap
    static String b = "hello"; // reference in Class object, String in String pool
    int           x;           // per-instance, in each object's heap allocation
    String        name;        // per-instance reference
}

MEMORY LAYOUT ────────────────────────────────────────────────────────────────────── STACK (Thread 1) HEAP METASPACE ┌──────────────────┐ ┌─────────────────────────┐ ┌────────────┐ │ main() frame │ │ java.lang.Class object │ │ Klass for │ │ ┌────────────┐ │ │ for Test │ │ Test │ │ │ t1 = ──────┼──┼──▶ │ ┌─────────────────┐ │ │ (vtable, │ │ │ t2 = ──────┼──┼──┐ │ │ static int a=42 │ │◀──┼─ klass ptr)│ │ │ │ │ │ │ │ static String b─┼──▶ │ └────────────┘ │ └────────────┘ │ │ │ │ (ref to pool) │ │ └──────────────────┘ │ │ └─────────────────┘ │ │ │ │ │ │ Test instance (t1) │ STRING POOL │ │ addr: 0x7F3C1000 │ ┌──────────┐ └▶│ ┌──────────────────┐ │ │ "hello" │ │ │ mark word │ │◀──┼─ (b ref) │ │ │ klass ptr → Meta │ │ └──────────┘ │ │ int x = 0 │ │ │ │ name = null │ │ │ └──────────────────┘ │ └─────────────────────────┘

Key Rules

Static variables live in the java.lang.Class mirror object on the heap (since Java 8). They are GC roots only as long as the Class object is reachable.
Instance variables live inside each object allocation on the heap — zero-initialized when allocated, then set by constructors/initializers.
Local variables (primitives) live in stack frame's local variable table — they are NOT initialized by default (must be explicitly assigned before use, enforced by the verifier).
Local reference variables — the reference (pointer) is on the stack, but the object it points to is always on the heap.

Variable Type	Where Stored	Default Value	GC Root?
Static primitive	Class object (heap)	0 / false	Via Class object
Static reference	Class object (heap)	null	Via Class object
Instance primitive	Object body (heap)	0 / false	Via enclosing object
Instance reference	Object body (heap)	null	Via enclosing object
Local primitive	Stack frame LVT	Undefined (error)	No (stack-scoped)
Local reference	Stack frame LVT	Undefined (error)	Yes (GC scans stacks)

Heap Architecture Internals

Generational Heap Layout

JAVA HEAP (-Xms2g -Xmx8g example) ────────────────────────────────────────────────────────────────────────── ┌───────────────────────────────────┬────────────────────────────────────┐ │ YOUNG GENERATION │ OLD GENERATION │ │ (~1/3 of heap) │ (~2/3 of heap) │ │ │ │ │ ┌────────────────┐ ┌────┐ ┌────┐ │ ┌──────────────────────────────┐ │ │ │ │ │ │ │ │ │ │ │ │ │ │ EDEN │ │ S0 │ │ S1 │ │ │ TENURED SPACE │ │ │ │ (new alloc) │ │ │ │ │ │ │ (long-lived objects) │ │ │ │ ~80% young │ │10% │ │10% │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ └────────────────┘ └────┘ └────┘ │ └──────────────────────────────┘ │ └───────────────────────────────────┴────────────────────────────────────┘ Object lifecycle: new Object() ──▶ EDEN │ Minor GC (Eden full) │ ├──▶ DEAD? ──▶ (collected) │ └──▶ ALIVE? ──▶ S0 (age=1) │ Minor GC again │ ├──▶ DEAD? ──▶ (collected) │ └──▶ ALIVE? ──▶ S1 (age=2) │ ... (oscillate S0/S1 each minor GC) │ age >= MaxTenuringThreshold (default 15) │ └──▶ OLD GENERATION (promoted)

Object Allocation Fast Path

Most allocations use the bump-pointer allocator in the thread's TLAB (Thread-Local Allocation Buffer). This is an O(1) operation: increment a pointer, check against the TLAB limit, done. No locking required.

// Pseudocode for fast allocation (actual: assembly in templateTable.cpp)
oop fast_allocate(size_t size) {
    oop result = tlab_top;         // current top of TLAB
    tlab_top += size;              // bump pointer
    if (tlab_top <= tlab_end) {    // fits in TLAB?
        memset(result, 0, size);   // zero-initialize
        return result;             // done — no lock!
    }
    // TLAB exhausted: slow path (refill or allocate directly in Eden)
    return slow_allocate(size);
}

Promotion Rules

Condition	What Happens
Object age >= MaxTenuringThreshold	Promoted to Old Gen
Survivor space >= 50% full (TargetSurvivorRatio)	Dynamic threshold lowering; some objects promoted early
Object too large for Eden (Humongous threshold)	G1: Allocated directly into Humongous region; CMS/Parallel: allocated in Old Gen
Old Gen GC pressure (Concurrent Mode Failure)	Fall-back Full GC (stop-the-world compaction)

Thread-Local Allocation Buffers (TLAB)

TLABs are a critical performance optimization that allows threads to allocate memory without any synchronization. Each Java thread owns a private region of Eden space called a TLAB.

EDEN SPACE with TLABs ────────────────────────────────────────────────────────────────────── EDEN ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ TLAB T-1 │ │ TLAB T-2 │ │ TLAB T-3 │ (free space) │ │ │ start──────┤ │ start──────┤ │ start──────┤ │ │ │ top ──────┤ │ top ──────┤ │ top ──────┤ │ │ │ end ──────┘ │ end ──────┘ │ end ──────┘ │ │ │ (Thread 1) │ (Thread 2) │ (Thread 3) │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ Allocation in TLAB (Thread 1): result = tlab.top tlab.top += object_size ← no CAS, no lock! if (tlab.top <= tlab.end) return result ← fast path

TLAB Sizing and Refill

TLAB size is adaptive. HotSpot tracks allocation rate per thread and adjusts TLAB size to balance between too-frequent refills (overhead) and too-large TLABs (waste/fragmentation). The target is approximately one TLAB refill per GC pause interval.

When a TLAB is exhausted and the requested object is small, a new TLAB is allocated from Eden using a CAS (compare-and-swap) on Eden's global free pointer. If the object is larger than the remaining TLAB space and smaller than -XX:TLABWasteTargetPercent, the JVM fills the TLAB remainder with a "filler" object (so GC can walk it) and allocates a new TLAB.

PLAB — Promotion-Local Allocation Buffers

During GC, surviving objects are copied to the Survivor or Old space. To avoid per-object CAS during copying, GC threads use PLABs — private buffers in the destination space. Multiple GC threads can work in parallel without contention. PLAB sizes are also adaptive.

Flag	Default	Effect
`-XX:TLABSize`	Adaptive (~2KB-1MB)	Initial TLAB size per thread
`-XX:+ResizeTLAB`	true	Enable adaptive TLAB resizing
`-XX:TLABWasteTargetPercent`	1%	Max Eden waste from unfilled TLABs
`-XX:+PrintTLAB`	false	Print TLAB statistics per GC

Garbage Collection Algorithms — Deep Internals

Generational Hypothesis

The fundamental insight driving all generational collectors: most objects die young. Empirically, 80-98% of objects in typical Java workloads are unreachable after a single minor GC. This enables short, inexpensive minor GCs that reclaim most garbage without scanning long-lived objects.

Serial GC

Single-threaded stop-the-world collector. Uses mark-compact for Old Gen. Suitable for single-core machines and small heaps (<100MB). Activated with -XX:+UseSerialGC.

Parallel GC (Throughput Collector)

Multi-threaded stop-the-world. Uses parallel copying for Young Gen, parallel mark-compact for Old Gen. Optimizes for maximum throughput. Default in Java 8. Activated with -XX:+UseParallelGC. Key parameter: -XX:MaxGCPauseMillis (soft target).

G1 GC — Garbage First

Default collector since Java 9. Divides heap into equal-sized regions (default 1–32MB). No fixed Young/Old partitions — regions are tagged as Eden, Survivor, or Old dynamically. Concurrent marking runs alongside application threads. Then Garbage First: selects regions with most garbage for collection first.

G1 HEAP — REGION-BASED LAYOUT ────────────────────────────────────────────────────────────────────── ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐ │ E │ E │ S │ O │ O │ H │ H │ E │ O │ S │ E │ F │ F │ └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘ E = Eden region (Young) S = Survivor region (Young) O = Old region H = Humongous region (objects > 50% of region size) F = Free region G1 Collection Phases: 1. YOUNG-ONLY PHASE: a. Concurrent Marking (SATB algorithm, runs with app) b. Remark (stop-the-world, finalize marking) c. Cleanup (reclaim fully-dead regions) 2. MIXED PHASE: a. Evacuate selected Eden + Survivor + some Old regions b. Copy live objects to free regions c. Old regions with most garbage selected first

ZGC — Z Garbage Collector

Low-latency collector targeting sub-millisecond pauses regardless of heap size (tested to 16TB). Uses colored pointers (load barriers) and region-based layout. Almost entirely concurrent — mark, relocate, and remap phases run concurrently with application threads. Stop-the-world pauses are <1ms.

Shenandoah GC

Also targeting low latency. Similar to ZGC in goals. Uses Brooks forwarding pointers — each object has an extra header field pointing to its current location. During concurrent compaction, the old copy's forwarding pointer redirects reads/writes to the new copy, enabling concurrent evacuation.

CMS (Concurrent Mark Sweep) — Deprecated

CMS was the first concurrent collector in HotSpot. It performs concurrent mark (with app threads running), then stop-the-world remark, then concurrent sweep. Does NOT compact — leads to heap fragmentation over time. Removed in Java 14 (-XX:+UseConcMarkSweepGC throws error).

GC Algorithm Comparison

Collector	Pause Model	Throughput	Latency	Best Use Case
Serial	STW all phases	Low	High	Single-core, embedded, tiny heaps
Parallel	STW all phases (parallel)	High	Medium	Batch processing, throughput priority
G1	Mostly concurrent + short STW	Good	Low	General purpose, >4GB heap
ZGC	Sub-ms STW	Good	Ultra-low	Latency-critical, huge heaps
Shenandoah	Sub-ms STW	Good	Ultra-low	Latency-critical, Red Hat systems

GC Barriers — Write and Read Barriers

GC barriers are small code snippets injected by the JIT compiler (and interpreter) around heap read/write operations. They allow concurrent GC phases to maintain invariants without stopping the world.

Write Barrier — Card Marking

When an old-generation object's field is updated to point to a young-generation object, GC must know about it (otherwise the young-gen object could be collected because the old-gen reference isn't scanned during minor GC). The card table tracks which old-gen memory regions contain pointers to young-gen objects.

CARD TABLE (one byte per 512B of heap) ────────────────────────────────────────────────────────────────────── Heap address: 0x10000 0x10200 0x10400 0x10600 ... │ │ │ │ Card table: [ 0 ][dirty][ 0 ][ 0 ] ... ↑ This 512B region of Old Gen contains a reference to Young Gen Card is "dirtied" by write barrier Write barrier code (simplified): // After: oldObj.field = newObj; card_table[addr_of_oldObj / 512] = DIRTY; // GC scans dirty cards to find cross-generational references

SATB Barrier — G1 / Shenandoah

Snapshot-At-The-Beginning (SATB): G1's concurrent marker takes a conceptual "snapshot" of the live object graph at the start of concurrent marking. As the application mutates the heap, any reference that is overwritten must be logged (to preserve the snapshot invariant). The write barrier logs old reference values to SATB queues.

// SATB write barrier pseudocode (G1)
void satb_write_barrier(oop* field_addr) {
    oop old_value = *field_addr;
    if (marking_active && old_value != null) {
        // Log old value to SATB queue (to be rescanned)
        satb_queue.enqueue(old_value);
    }
    // Actual field write proceeds normally
}

Load Barrier — ZGC

ZGC uses load barriers on every reference load from the heap. The barrier checks colored pointer metadata bits to determine if the referenced object needs to be relocated. If yes, the barrier updates the reference in-place to point to the new location. This enables concurrent relocation without stopping the world.

// ZGC load barrier pseudocode
oop load_barrier(oop* addr) {
    oop ref = *addr;
    if (ref.metadata_bits & BAD_COLOR_BITS) {
        // Object needs remapping or relocation
        ref = slow_path_fixup(addr, ref);
    }
    return ref;   // guaranteed to be in correct location
}

Performance Impact of Barriers

Barriers add overhead to every heap read or write. ZGC's load barrier adds ~4ns overhead per reference load. G1's write barrier (card marking + SATB) adds ~2-5ns per write. JIT compilers inline barriers and apply optimizations to eliminate redundant barrier checks.

Safepoints — Stopping the World

A safepoint is a point in execution where all JVM threads are in a known, consistent state — enabling the JVM to perform operations that require exclusive access to the heap (GC, deoptimization, stack scanning, class redefinition).

How Safepoints Work

SAFEPOINT MECHANISM ────────────────────────────────────────────────────────────────────── VMThread requests safepoint │ ▼ JVM sets "safepoint requested" flag │ ├──── Threads executing BYTECODES: │ At each backward branch and method entry, │ the template interpreter polls a memory location. │ When safepoint is requested, that page is made │ inaccessible → SIGSEGV → JVM signal handler → │ thread blocks at safepoint │ ├──── Threads in JIT-compiled code: │ JIT emits polling instructions at: │ - method entries │ - loop back-edges │ - return points │ Reads from safepoint polling page → SIGSEGV │ → thread blocks │ ├──── Threads in native code (JNI): │ Cannot block immediately (in native code) │ Must check flag when returning to Java │ GC doesn't need to wait — native frames │ are "safe" (no Java heap manipulation) │ └──── Thread blocked/sleeping: Already safe — no action needed All threads at safepoint │ ▼ VMThread performs: GC / deopt / stack scan / etc. │ ▼ VMThread resumes all threads

Safepoint Polling Mechanism

Modern HotSpot uses a dedicated polling page in virtual memory. Under normal execution, this page is readable (poll = read from page = no-op). When a safepoint is needed, the JVM makes this page inaccessible using mprotect(). Threads reading the page receive a SIGSEGV, which the JVM signal handler converts into a safepoint block.

// JIT-generated polling at method exit (x86_64 assembly)
;; After completing method body:
testb  rax, [safepoint_polling_page]   ; poll the page
;; If page is accessible: no-op, continue
;; If page is protected: SIGSEGV → JVM blocks thread at safepoint

Deoptimization at Safepoints

When a JIT optimization is invalidated (e.g., an inlined virtual call's target class changes due to class loading), the JVM schedules a deoptimization. At the next safepoint, compiled frames are converted back to interpreter frames. The JVM uses on-stack replacement (OSR) to resume execution in the interpreter at the exact bytecode position where the deopt occurred.

Time-to-Safepoint Latency

A critical GC performance metric is "time to safepoint" — the time from requesting a safepoint to all threads reaching it. Causes of long TTSP: tight loops with no safepoint polls (rare in modern HotSpot), JNI-heavy workloads, or very long-running native calls. Monitor with -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=200.

JIT Compiler Architecture — C1 and C2

HotSpot has two JIT compilers that cooperate via tiered compilation. Understanding their compilation pipelines is essential for performance tuning and debugging unexpected slowdowns.

Tiered Compilation Levels

Tier	Executor	Optimization Level	Profiling?
0	Template Interpreter	None	Method invocation + back-edge counters
1	C1	Simple (no profiling)	No
2	C1	Limited	Invocation + back-edge counters only
3	C1	Full C1	Full profiling: branch stats, type profiles, call target profiles
4	C2	Maximum	No (uses Tier 3 profile data)

C1 Compiler Pipeline

C1 COMPILATION PIPELINE ────────────────────────────────────────────────────────────────────── Bytecode │ ▼ HIR (High-level IR) │ [build SSA (Static Single Assignment) form] │ [local value numbering] │ [null check elimination] │ ▼ LIR (Low-level IR) │ [register allocation (linear scan)] │ [peephole optimizations] │ ▼ Machine Code │ [nmethod stored in Code Cache] │ [patching callsites for inline caches] ▼ Native execution

C2 Compiler Pipeline

C2 COMPILATION PIPELINE (sea of nodes IR) ────────────────────────────────────────────────────────────────────── Bytecode + Tier-3 Profile Data │ ▼ Parse + Inline (recursive, using CHA + profile-based inlining) │ ▼ Ideal Graph (Sea of Nodes — value nodes + control nodes) │ ▼ OPTIMIZATION PASSES: ├── Global Value Numbering ├── Constant Folding ├── Dead Code Elimination ├── Loop Transformations (unrolling, vectorization) ├── Escape Analysis → Stack Allocation / Scalar Replacement ├── Lock Elimination (via escape analysis) ├── Conditional Constant Propagation └── Algebraic Simplifications │ ▼ Scheduling + Register Allocation (graph coloring) │ ▼ Machine Code Emission │ ▼ nmethod stored in Code Cache

Compilation Thresholds

// With TieredCompilation (default Java 8+)
// Interpreter → C1 Tier 3 when:
//   invocation_count > CompileThreshold * InterpreterProfilePercentage / 100
//   back_edge_count   > OnStackReplacePercentage * CompileThreshold / 100
//   (defaults: CompileThreshold=10000, InterpreterProfilePercentage=33)

// C1 Tier 3 → C2 Tier 4 when:
//   C1-profiled invocation count exceeds threshold
//   (roughly 15,000 invocations total)

// To force immediate C2 compilation (testing):
// -XX:CompileThreshold=1 -XX:-TieredCompilation

Inline Cache and Megamorphic Call Sites

For invokevirtual, the first call site is monomorphic — C1/C2 inline the single observed target (using C1/C2 inlining). If a second type appears, the call site becomes bimorphic (two-way branch). If more than two types appear, it becomes megamorphic — the JIT falls back to a vtable dispatch without inlining. Megamorphic call sites are a significant optimization barrier; keeping call sites monomorphic is critical for performance-sensitive code.

JIT Optimizations — Deep Dive

Method Inlining

The most impactful JIT optimization. When a callee method is inlined, the call overhead disappears and the combined code graph enables further optimizations (constant folding, dead code elimination, etc.).

// Before inlining
int result = obj.getValue();  // method call overhead
int getValue() { return this.value; }

// After inlining (C2's IR)
int result = obj.value;  // direct field access, no call

C2 inlines methods up to -XX:MaxInlineSize (default 35 bytecodes) and -XX:InlineSmallCode (default 1000 bytes of compiled code). Recursive inlining is bounded by -XX:MaxInlineLevel (default 9 levels).

Dead Code Elimination (DCE)

C2's sea-of-nodes IR naturally eliminates unreachable nodes. If a condition is provably always true/false (from profiling), the dead branch is eliminated. Example: after type-checking inlining, null checks on provably non-null references are eliminated.

Constant Folding and Propagation

final int X = 10;
int y = X * 3;  // folded to: int y = 30; at compile time
if (y > 20)   // folded to: if (true) → dead else branch eliminated

Loop Unrolling

Short loops with small fixed iteration counts are unrolled — the loop body is repeated N times with the loop control reduced or eliminated. This reduces branch prediction pressure and enables SIMD vectorization.

// Source
for (int i = 0; i < 4; i++) arr[i] = i * 2;

// After loop unrolling (4x):
arr[0] = 0; arr[1] = 2; arr[2] = 4; arr[3] = 6;
// No loop overhead at all!

Null Check Elimination

After a null check in one code path, C2 tracks that the reference is non-null on the "passes" path, eliminating subsequent redundant null checks. This is called flow-sensitive type refinement.

Range Check Elimination (RCE)

For array accesses inside loops with provably bounded indices, C2 hoists the range check outside the loop (checked once before the loop starts, not on every iteration):

// Before RCE:
for (int i = 0; i < arr.length; i++) {
    // implicit: if (i < 0 || i >= arr.length) throw AIOOBE
    sum += arr[i];
}

// After RCE: check hoisted, loop body is bounds-check-free
if (arr.length > 0) {
    for (int i = 0; i < arr.length; i++) {
        sum += arr[i];  // no bounds check!
    }
}

Escape Analysis — Stack Allocation and Scalar Replacement

Escape analysis determines whether an object allocated in a method can "escape" that method (i.e., be referenced from outside). Objects that do not escape can be subject to powerful optimizations.

Escape States

State	Meaning	Optimization Possible
NoEscape	Object only used within the allocating method; never passed to another method or stored in a field	Scalar replacement + Stack allocation
ArgEscape	Object passed to methods but those methods don't store it globally	Lock elimination
GlobalEscape	Object stored in static field, returned from method, or passed to native	None — must heap allocate

Scalar Replacement

When an object is NoEscape, C2 can decompose it into its individual fields as scalar values (removing the object entirely). This eliminates heap allocation, reduces GC pressure, and allows the scalar values to live in CPU registers.

class Point { int x, y; }

int sumCoords() {
    Point p = new Point();  // NoEscape: p never leaves this method
    p.x = 3; p.y = 4;
    return p.x + p.y;
}

// After scalar replacement — NO heap allocation:
int sumCoords() {
    int px = 3;   // scalar: p.x
    int py = 4;   // scalar: p.y
    return px + py;  // constant-folded to: return 7
}

Lock Elimination

If an object is NoEscape or ArgEscape, its monitor lock can never be contended — no other thread can see it. C2 eliminates synchronized blocks on such objects entirely:

synchronized (new Object()) {   // lock on NoEscape object
    doWork();                      // synchronized block eliminated by JIT!
}

// Also applies to StringBuffer (synchronized) operations
// if the StringBuffer doesn't escape the method
StringBuffer sb = new StringBuffer();
sb.append("a").append("b");   // locks eliminated — sb is NoEscape

Escape Analysis Limits

Escape analysis in C2 is interprocedural within the inlining budget. Objects that escape through methods not inlined (too large, recursive, megamorphic) are not candidates. JVM flags:

-XX:+DoEscapeAnalysis      (default: true in Java 8+)
-XX:+EliminateAllocations  (default: true — enable scalar replacement)
-XX:+EliminateLocks        (default: true — enable lock elimination)
-XX:+PrintEscapeAnalysis   (debug: print EA results)

Vectorization — SIMD Optimizations

Modern CPUs support SIMD (Single Instruction, Multiple Data) instructions — SSE2/SSE4, AVX/AVX2/AVX-512 on x86, NEON on AArch64. These instructions operate on 128–512 bit wide registers, processing 4–16 integers (or 2–8 doubles) per instruction. C2 auto-vectorizes certain loop patterns.

Auto-Vectorization Conditions

For C2 to vectorize a loop:

Loop must have a countable iteration count
No loop-carried dependencies (iterations must be independent)
Array accesses must be sequential (stride-1)
Operations must be vectorizable (arithmetic, comparisons, simple logical)

// Vectorizable loop:
int[] a = ..., b = ..., c = ...;
for (int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];  // independent iterations — vectorized!
}

// C2 may emit (AVX2, processes 8 ints at a time):
; vmovdqu ymm0, [a + i*4]    ; load 8 ints from a
; vmovdqu ymm1, [b + i*4]    ; load 8 ints from b
; vpaddd  ymm2, ymm0, ymm1   ; add 8 ints in parallel
; vmovdqu [c + i*4], ymm2    ; store 8 ints to c

// NOT vectorizable (loop-carried dependency):
for (int i = 1; i < n; i++) {
    a[i] = a[i-1] + b[i];  // depends on previous iteration
}

Panama Vector API (Java 16+ incubator, stable Java 21+)

The Vector API provides explicit SIMD programming in Java, guaranteed to map to hardware vector instructions. Unlike auto-vectorization (best-effort), the Vector API guarantees vector execution:

import jdk.incubator.vector.*;
VectorSpecies<Integer> SPECIES = IntVector.SPECIES_256;  // 256-bit = 8 ints

void addArrays(int[] a, int[] b, int[] c) {
    for (int i = 0; i < a.length; i += SPECIES.length()) {
        IntVector va = IntVector.fromArray(SPECIES, a, i);
        IntVector vb = IntVector.fromArray(SPECIES, b, i);
        va.add(vb).intoArray(c, i);   // SIMD add, guaranteed
    }
}

Java Memory Model — Happens-Before and Visibility

The Java Memory Model (JMM), specified in JSR-133, defines the semantics of multithreaded Java programs. Without the JMM, compilers and CPUs can reorder memory operations in ways that produce counterintuitive results.

Memory Visibility Problem

// Without synchronization — BROKEN
boolean flag = false;

// Thread 1:
data = compute();  // (1)
flag = true;       // (2)

// Thread 2:
while (!flag) {}   // spin (3)
use(data);          // (4) MAY SEE STALE DATA!
// Thread 2 could see flag=true but data still 0
// CPU can reorder (1) and (2)
// CPU's store buffer may not flush to cache in order

Happens-Before Rules

Action A happens-before action B means: A's effects are guaranteed to be visible to B. The JMM defines the following happens-before edges:

Rule	HB Edge
Program order	Within a single thread: A before B in source → A HB B
Monitor lock	`unlock(m)` HB every subsequent `lock(m)` on the same monitor
Volatile write	`volatile write to v` HB every subsequent `volatile read of v`
Thread start	`Thread.start()` HB all actions in the started thread
Thread join	All actions in T HB `T.join()` returning
Object finalization	End of constructor HB start of finalizer
Transitivity	If A HB B and B HB C, then A HB C

Volatile Semantics

A volatile field provides two guarantees: visibility (every read sees the last write) and ordering (no reordering across a volatile access). The JIT must not cache volatile variables in registers and must emit memory fences.

VOLATILE MEMORY FENCES (x86_64) ────────────────────────────────────────────────────────────────────── Volatile WRITE: [normal stores] LOCK XCHG (acts as full StoreLoad fence on x86) // prevents subsequent loads from seeing stale values // ensures this store is globally visible Volatile READ: [just a plain load on x86 — x86 is TSO, loads are ordered] // On ARM/POWER: need LoadLoad + LoadStore fence // ARM: ldar instruction (load-acquire) Java volatile write → Java volatile read: full fence Java volatile read → Java volatile write: no fence needed (HB from same thread)

Double-Checked Locking — The Classic Trap

// BROKEN before Java 5 / without volatile:
private static Singleton instance;
public static Singleton getInstance() {
    if (instance == null) {
        synchronized (Singleton.class) {
            if (instance == null)
                instance = new Singleton();  // UNSAFE!
        }
    }
    return instance;
}
// Problem: "new Singleton()" is NOT atomic:
// 1. Allocate memory for Singleton
// 2. Call constructor (initialize fields)
// 3. Assign reference to instance
// Steps 2 and 3 CAN be reordered by CPU/compiler!
// Thread B may see a non-null but uninitialized instance

// CORRECT — volatile prevents reordering:
private static volatile Singleton instance;
// volatile write in step 3 prevents CPU from reordering with step 2

Memory Barrier Types

Barrier	Prevents	JMM Use Case
LoadLoad	Load reordering with prior load	Volatile read
LoadStore	Store reordering with prior load	Volatile read
StoreStore	Store reordering with prior store	Before volatile write
StoreLoad	Load reordering with prior store	After volatile write (strongest; prevents all reordering)

JVM Synchronization Internals

Every Java object is a potential lock (monitor). The JVM implements a three-tier locking strategy to minimize overhead for the common case (no contention).

Lock State Transitions

LOCK STATE MACHINE ────────────────────────────────────────────────────────────────────── UNLOCKED (mark word: [hash|age|0|01]) │ │ First synchronized entry by Thread T │ (if biased locking enabled) ▼ BIASED (mark word: [ThreadID|epoch|age|1|01]) │ │ Another thread tries to lock same object │ → revoke bias at safepoint ▼ LIGHTWEIGHT (mark word: [ptr to lock record|00]) │ │ Contention: second thread tries to lock │ while first holds it │ → inflate to heavyweight ▼ HEAVYWEIGHT (mark word: [ptr to ObjectMonitor|10]) │ │ Java 21+: Biased locking REMOVED │ Lightweight locking still present ▼ Contended blocking (ObjectMonitor entry queue)

Lightweight Locking

When a thread enters a synchronized block (and biased locking is not applicable), it performs a CAS to atomically swap the mark word: it stores the original mark word in a lock record on its stack, then sets the mark word to point to this lock record (with lock bits = 00). If the CAS succeeds, the thread owns the lock. On exit, CAS swaps back. This is O(1) and requires no OS kernel involvement.

// Lightweight lock entry (monitorenter bytecode)
lock_record.displaced_header = object->mark_word()  // save original
if (CAS(object->mark_word, original, ptr_to_lock_record | LOCKED_BITS)) {
    // CAS succeeded: thread owns the lock
} else if (is_our_lock_record(object->mark_word)) {
    // Recursive lock: increment recursion count
} else {
    // Contention: inflate to heavyweight (ObjectMonitor)
    inflate_and_enter(object);
}

ObjectMonitor — Heavyweight Lock Internals

When a lock is inflated, HotSpot allocates an ObjectMonitor C++ object. This contains: the owner thread, an entry list (waiting to acquire), a wait set (threads in Object.wait()), and a recursion counter.

// ObjectMonitor key fields (hotspot/src/share/vm/runtime/objectMonitor.hpp)
class ObjectMonitor {
  volatile markWord _header;      // saved displaced mark word
  volatile JavaThread* _owner;    // owning thread
  volatile intptr_t  _recursions; // recursion depth
  ObjectWaiter*      _EntryList;  // threads blocked on entry
  ObjectWaiter*      _WaitSet;    // threads in wait()
  volatile int       _waiters;    // count of wait() callers
};

Biased Locking Removed (Java 21)

Biased locking was removed as a feature in Java 21 (deprecated in Java 15, removed in Java 21). Modern hardware CAS operations are cheap enough that the biased locking optimization's benefit (eliminating CAS on uncontended locks) was outweighed by the cost of revoking bias at safepoints when contention did occur.

Production Advice: For Java 17+, add -XX:-UseBiasedLocking to avoid biased locking overhead. In Java 21+, biased locking is gone. Measure lock contention with jstack or JFR LockEvent.

String Pool Internals

Java's String class is immutable, making sharing safe. The JVM maintains a String Constant Pool (also called the String intern table) in the heap (since Java 7) backed by a fixed-size hash table.

How String Literals Are Interned

STRING POOL INTERNMENT FLOW ────────────────────────────────────────────────────────────────────── "hello" ← String literal in bytecode (LDC instruction) │ ▼ JVM checks StringTable hash map: ├── Found? → return reference to existing String object └── Not found? → create new String in heap, add to StringTable, return reference StringTable (in-heap hash map, default 65536 buckets in Java 11+): Key: String content (char sequence) Value: Reference to String object in heap Pool is GC-able since Java 7 (strings in heap, not PermGen)

intern() and == vs equals()

String s1 = "Java";            // interned, pool ref
String s2 = "Java";            // same pool ref as s1
String s3 = new String("Java"); // new heap object, NOT pool
String s4 = s3.intern();       // returns pool ref

System.out.println(s1 == s2);  // TRUE  — same pool object
System.out.println(s1 == s3);  // FALSE — different objects
System.out.println(s1 == s4);  // TRUE  — s4 is pool ref
System.out.println(s1.equals(s3)); // TRUE  — same content

Compile-Time String Concatenation (Java 9+ invokedynamic)

Since Java 9, string concatenation using + uses invokedynamic with StringConcatFactory (rather than StringBuilder as in Java 8). This enables JVM-level optimization of string building strategies at runtime.

// Java source:
String s = name + " has age " + age;

// Java 8 bytecode (old StringBuilder approach):
// new StringBuilder → append(name) → append(" has age ") → append(age) → toString()

// Java 9+ bytecode (invokedynamic approach):
// invokedynamic #1 (StringConcatFactory.makeConcatWithConstants)
// Bootstrap: generates optimized byte[] assembly at runtime

String Deduplication (G1 GC)

With -XX:+UseStringDeduplication (G1 only), the GC deduplicated String char arrays that have the same content. This doesn't affect String object identity (different String objects remain different), but their underlying char[] backing arrays are merged, saving memory in workloads with many duplicate strings.

Machine Code Generation — From IR to Assembly

C2's backend translates its Ideal graph (sea of nodes) through a series of steps to produce native machine code. Understanding this helps diagnose JIT compilation failures and unexpected performance characteristics.

Register Allocation

C2 uses graph coloring register allocation. Variables (IR nodes) are mapped to machine registers; when there are more live values than registers, the allocator spills to the stack. The register allocation quality directly determines instruction density and spill overhead.

Generated Assembly Example

// Java source:
public static long fibonacci(int n) {
    if (n <= 1) return n;
    return fibonacci(n - 1) + fibonacci(n - 2);
}

// C2-generated x86_64 (simplified, after inlining base case):
fibonacci:
  ; prolog: stack frame setup
  push  rbp
  mov   rbp, rsp
  ; check n <= 1
  cmp   edi, 1
  jle   .base_case
  ; fibonacci(n-1)
  lea   edi, [rdi - 1]
  call  fibonacci
  mov   rbx, rax         ; save result
  ; fibonacci(n-2)
  lea   edi, [rdi - 2]
  call  fibonacci
  add   rax, rbx         ; sum results
  pop   rbp
  ret
.base_case:
  movsx rax, edi
  pop   rbp
  ret

Instruction Scheduling

C2 schedules instructions to hide CPU pipeline latency. For example, a load (3–4 cycle latency) followed by an immediate use creates a pipeline stall. C2 inserts independent instructions between the load and its first use to overlap execution. On out-of-order CPUs (all modern x86/ARM), the CPU also performs hardware instruction scheduling.

PrintAssembly — Viewing JIT Output

# Requires hsdis library for disassembly
java -XX:+PrintCompilation \
     -XX:+UnlockDiagnosticVMOptions \
     -XX:+PrintAssembly \
     -XX:CompileCommand=print,MyClass.myMethod \
     MyClass

JVM Performance Engineering

Essential JVM Flags Reference

Flag	Purpose	Typical Value
`-Xms`	Initial heap size	`-Xms2g` (set = Xmx to avoid resizing)
`-Xmx`	Maximum heap size	`-Xmx8g`
`-Xss`	Thread stack size	`-Xss512k` (reduce for many threads)
`-XX:MetaspaceSize`	Initial Metaspace size	`-XX:MetaspaceSize=256m`
`-XX:MaxMetaspaceSize`	Metaspace ceiling (important!)	`-XX:MaxMetaspaceSize=512m`
`-XX:+UseG1GC`	Use G1 collector	Default Java 9+
`-XX:+UseZGC`	Use ZGC collector	Low-latency apps
`-XX:MaxGCPauseMillis`	G1 pause target (soft)	`-XX:MaxGCPauseMillis=200`
`-XX:GCTimeRatio`	Throughput ratio (1/(1+ratio))	`-XX:GCTimeRatio=9` (10% GC overhead max)
`-XX:+PrintGCDetails`	Verbose GC logging	Use in staging/production
`-Xlog:gc*:file=/path/gc.log`	Unified GC logging (Java 9+)	Always enable in production
`-XX:ReservedCodeCacheSize`	JIT code cache size	`-XX:ReservedCodeCacheSize=512m`
`-XX:+TieredCompilation`	Enable tiered JIT (default)	Leave enabled
`-XX:+HeapDumpOnOutOfMemoryError`	Dump heap on OOM	Always enable in production
`-XX:HeapDumpPath=/dumps/`	Where to write heap dump	On fast disk

GC Tuning Strategy

Step 1: Define your goals — throughput, latency, footprint (pick two). Step 2: Choose the right collector (G1 for general purpose, ZGC/Shenandoah for latency, Parallel for throughput). Step 3: Set heap size (Xms=Xmx to avoid heap resizing pauses). Step 4: Tune GC-specific parameters. Step 5: Measure with real workload, iterate.

Monitoring Tools

Tool	Command	What It Shows
`jstat`	`jstat -gcutil <pid> 1000`	GC statistics: heap usage %, GC count, GC time per second
`jmap`	`jmap -heap <pid>`	Heap configuration, usage; `-histo` for live object histogram
`jstack`	`jstack <pid>`	Thread dumps: detect deadlocks, hotspots, blocked threads
`jcmd`	`jcmd <pid> VM.flags`	JVM flags, GC stats, heap info, thread info, JFR control
`Java Flight Recorder`	`jcmd <pid> JFR.start duration=60s filename=recording.jfr`	CPU, memory, GC, I/O, lock contention profiling with <1% overhead
`JVM Mission Control`	GUI for JFR analysis	Flame graphs, GC analysis, lock analysis, allocation profiling
`async-profiler`	`./profiler.sh -d 30 -f profile.html <pid>`	CPU/allocation/lock profiling via AsyncGetCallTrace (no safepoint bias)

Finding Memory Leaks

# 1. Take heap dump
jcmd <pid> GC.heap_dump /tmp/heap.hprof

# 2. Analyze with Eclipse Memory Analyzer (MAT)
#    - Open heap dump
#    - Run "Leak Suspects Report"
#    - Look for dominator tree — largest retained heaps
#    - Find ClassLoader leaks via "Class Loader Explorer"

# Common leak patterns:
# 1. Static collections holding object references (static Map, List)
# 2. Listener/callback registrations never deregistered
# 3. ThreadLocal values not removed (especially in thread pools)
# 4. Inner class references to outer class (anonymous listeners)
# 5. ClassLoader leaks in hot-deploy scenarios

JVM Failure Analysis

StackOverflowError

Thrown when the JVM thread stack reaches its maximum size (-Xss). Each method call adds a stack frame; deeply recursive methods or infinite recursion exhaust the stack.

// Cause: infinite recursion
void recurse() { recurse(); }  // StackOverflowError after ~1000-10000 frames

// Cause: legitimate deep recursion on large inputs
// Fix: convert to iterative + explicit Stack data structure

// Diagnosis:
// jstack shows all frames in the thread that overflowed
// Look for repeating frame pattern at the bottom

OutOfMemoryError Variants

OOM Message	Root Cause	Diagnosis
`Java heap space`	Heap exhausted; too many live objects	Heap dump + MAT; check -Xmx; find leaks
`GC overhead limit exceeded`	JVM spending >98% time in GC with <2% reclaimed	Increase heap; find allocation hotspots with JFR
`Metaspace`	Metaspace exhausted; too many loaded classes	jcmd VM.classloaders; check for ClassLoader leaks
`unable to create new native thread`	OS thread limit or process memory exhausted	Reduce -Xss; reduce thread count; check ulimits
`Direct buffer memory`	Off-heap direct ByteBuffer exhausted	Increase -XX:MaxDirectMemorySize; find unreleased buffers
`Code Cache`	JIT code cache full; JIT compilation disabled	Increase -XX:ReservedCodeCacheSize; look for code cache flushing

Metaspace Leak — ClassLoader Leak Pattern

// Leak pattern: ClassLoader held alive by Thread
Thread t = new Thread(task);
t.setContextClassLoader(customCL);  // Thread holds reference to CL
t.start();
// If thread stays alive in thread pool, customCL (and all its classes) are retained

// Fix: clear TCCL before returning thread to pool
try {
    task.run();
} finally {
    Thread.currentThread().setContextClassLoader(originalCL);
}

Diagnosing Long GC Pauses

# Enable detailed GC logging
-Xlog:gc+phases*=debug:file=gc.log:time,uptime:filecount=5,filesize=20m

# Common causes of long pauses:
# 1. Huge survivor spaces → large copy overhead
#    Fix: -XX:SurvivorRatio, -XX:MaxTenuringThreshold
# 2. Long time-to-safepoint
#    Monitor with: -XX:+PrintSafepointStatistics
# 3. G1 Humongous allocations causing early mixed GC
#    Increase -XX:G1HeapRegionSize
# 4. Concurrent Mode Failure (G1: evacuation failure)
#    Increase heap or tune InitiatingHeapOccupancyPercent

JVM Debugging Tools — Complete Reference

jstack — Thread Analysis

jstack <pid>               # thread dump to stdout
jstack -l <pid>            # + lock information
kill -3 <pid>              # trigger thread dump via signal
jcmd <pid> Thread.print    # alternative via jcmd

Thread states in jstack output:
RUNNABLE    - executing or ready to run on CPU
BLOCKED     - waiting for a monitor lock
WAITING     - Object.wait(), Thread.join(), LockSupport.park()
TIMED_WAITING - same but with timeout
NEW         - not yet started
TERMINATED  - finished execution

jmap — Heap Analysis

jmap -heap <pid>                     # heap summary (generation sizes)
jmap -histo <pid>                    # live object histogram (class → count, bytes)
jmap -histo:live <pid>               # force GC first, then histogram
jmap -dump:live,format=b,file=h.hprof <pid> # heap dump

jcmd — Swiss Army Knife

jcmd <pid> help                      # list all available commands
jcmd <pid> VM.flags                  # all JVM flags (including defaults)
jcmd <pid> VM.system_properties      # system properties
jcmd <pid> VM.version                # JVM version info
jcmd <pid> GC.run                    # trigger GC (hint)
jcmd <pid> GC.heap_info              # heap usage
jcmd <pid> GC.heap_dump /tmp/h.hprof # heap dump
jcmd <pid> Thread.print              # thread dump
jcmd <pid> VM.classloaders           # ClassLoader hierarchy
jcmd <pid> Compiler.queue            # JIT compilation queue
jcmd <pid> Compiler.codecache        # code cache usage

# Java Flight Recorder
jcmd <pid> JFR.start name=myRec duration=60s filename=rec.jfr
jcmd <pid> JFR.dump name=myRec filename=rec.jfr
jcmd <pid> JFR.stop name=myRec

Java Flight Recorder (JFR)

JFR is a production-safe, low-overhead profiler built into the JVM. It records time-series data about JVM and application events: CPU usage, garbage collection, JIT compilation, I/O, lock acquisition, memory allocation (with stack traces), exceptions, and custom application events. Overhead is <1% CPU in most workloads.

JFR Event Category	What It Reveals
GC events	Pause times, GC causes, before/after heap sizes, phase breakdown
JIT compilation	Method compilation times, code cache usage, deoptimization events
Allocation profiling	Top allocating methods and classes (with stack traces)
Lock contention	Locks with highest contention, average wait time, waiters
CPU/Method profiling	Hot methods consuming CPU (async sampling)
I/O profiling	File and network read/write latency breakdown

Complete JVM Memory Map

COMPLETE JVM MEMORY MAP (64-bit Linux, Java 21, G1 GC) (Conceptual — actual addresses vary by OS, ASLR, heap settings) ──────────────────────────────────────────────────────────────────────────────── PROCESS VIRTUAL ADDRESS SPACE (48-bit user space = 128TB addressable) 0x7FFF_FFFF_FFFF ┌─────────────────────────────────────────────┐ │ KERNEL SPACE (inaccessible) │ ├─────────────────────────────────────────────┤ │ Thread N Stack (-Xss = 1MB each) │ │ ┌────────────────────────────────┐ │ │ │ Frame: doWork() │ │ │ │ LVT: [this, result, i] │ │ │ │ Operand stack │ │ │ ├────────────────────────────────┤ │ │ │ Frame: process() │ │ │ │ LVT: [input, helper] │ │ │ ├────────────────────────────────┤ │ │ │ Frame: main() │ │ │ └────────────────────────────────┘ │ │ ... (Thread 2 Stack, Thread 1 Stack) │ ├─────────────────────────────────────────────┤ │ Native Libraries (libjvm.so, libc) │ ├─────────────────────────────────────────────┤ ~0x7F00_0000_0000 │ CODE CACHE (ReservedCodeCacheSize) │ │ ┌──────────────────────────────────────┐ │ │ │ nmethods (JIT-compiled methods) │ │ │ │ Profiled nmethods (C1 Tier 3) │ │ │ │ Non-profiled nmethods (C1 Tier 1) │ │ │ │ Non-nmethods (stubs, adapters) │ │ │ └──────────────────────────────────────┘ │ ├─────────────────────────────────────────────┤ ~0x7EF0_0000_0000 │ METASPACE (native memory) │ │ ┌──────────────────────────────────────┐ │ │ │ InstanceKlass for MyApp │ │ │ │ ConstantPool (resolved refs) │ │ │ │ Methods array │ │ │ │ Method: main() bytecode │ │ │ │ Method: compute() bytecode │ │ │ │ vtable [0]=Object::toString │ │ │ │ [1]=Object::equals │ │ │ │ [2]=MyApp::myVirtMethod │ │ │ │ InstanceKlass for java.lang.String │ │ │ │ ... (all loaded classes) │ │ │ └──────────────────────────────────────┘ │ ├─────────────────────────────────────────────┤ ~0x0700_0000_0000 │ JAVA HEAP (G1 regions) │ │ ┌──────────────────────────────────────┐ │ │ │ │ │ │ │ ── YOUNG GENERATION REGIONS ── │ │ │ │ Region 0x0700_0000_0000 [Eden] │ │ │ │ obj @+0x00: [markword][klass][x=5]│ │ │ │ obj @+0x20: [markword][klass][...] │ │ │ │ TLAB Thread-1: top=@+0x1E00 │ │ │ │ Region 0x0700_0040_0000 [Survivor] │ │ │ │ │ │ │ │ ── OLD GENERATION REGIONS ── │ │ │ │ Region 0x0700_0080_0000 [Old] │ │ │ │ long-lived objects │ │ │ │ java.lang.Class objects (static │ │ │ │ fields embedded here) │ │ │ │ │ │ │ │ ── HUMONGOUS REGIONS ── │ │ │ │ Large objects (>½ region size) │ │ │ │ span multiple contiguous regions │ │ │ │ │ │ │ │ ── STRING TABLE (in heap) ── │ │ │ │ StringTable: hash buckets │ │ │ │ bucket[hash("Java")] → String@ │ │ │ │ bucket[hash("hello")] → String@ │ │ │ └──────────────────────────────────────┘ │ ├─────────────────────────────────────────────┤ ~0x0000_0000_0001 │ JVM binary, mmap'd files │ 0x0000_0000_0000 └─────────────────────────────────────────────┘

Object Cross-Reference Map

CROSS-REFERENCE EXAMPLE: String s = new String("hello") in main() ──────────────────────────────────────────────────────────────────────────────── STACK (Thread 1 main() frame) local[1] = s ──────────────────────────────────┐ │ HEAP │ ┌──────────────────────────────────────────┐ │ │ String object @ 0x0700_0000_1000 │◀──┘ │ +0 markword: 0x0000_0000_0000_0001 │ │ +8 klass ptr → String Klass (Metaspace)│───▶ InstanceKlass{java.lang.String} │ +12 hash: 0 (not computed yet) │ │ +16 value → byte[] @ 0x0700_0000_2000 │───┐ │ +20 coder: 0 (LATIN1) │ │ └──────────────────────────────────────────┘ │ │ ┌──────────────────────────────────────────┐ │ │ byte[] @ 0x0700_0000_2000 │◀──┘ │ +0 markword │ │ +8 klass ptr → byte[] TypeArrayKlass │ │ +12 length: 5 │ │ +16 data: 'h' 'e' 'l' 'l' 'o' │ └──────────────────────────────────────────┘ STRING TABLE (intern table, in heap) bucket[hash("hello")] ─────────────────────────▶ String @ 0x0700_0000_3000 (different object than s above — s was new String())

Bytecode to CPU — Full Execution Pipeline

When JIT-compiled code executes on the CPU, it passes through the CPU's own pipeline stages. Understanding CPU execution behavior is essential for the highest-level JVM performance work.

CPU Execution Pipeline (Modern Out-of-Order x86_64)

x86_64 CPU PIPELINE (Intel Skylake/Ice Lake) ──────────────────────────────────────────────────────────────────────────────── JIT Machine Code Stream (memory) │ ▼ ┌─────────────────────────────────────────────────────┐ │ FRONTEND (instruction fetch and decode) │ │ │ │ [Instruction Cache (L1i)] → [Branch Predictor] │ │ │ predicts target │ │ [Fetch Unit: 16B/cycle] → [Decode: 4 instr/cycle] │ │ │ │ Branch prediction: 97%+ accuracy on regular loops │ │ Misprediction penalty: ~15-20 cycles │ └─────────────────────┬───────────────────────────────┘ │ Decoded µops ▼ ┌─────────────────────────────────────────────────────┐ │ SCHEDULER + OUT-OF-ORDER ENGINE │ │ │ │ [ROB: Reorder Buffer ~512 µops] │ │ [Reservation Station ~97 µops] │ │ │ │ Execution Ports: │ │ Port 0: ALU, mul, div, branch │ │ Port 1: ALU, shuffle, crypto │ │ Port 2: Load │ │ Port 3: Load │ │ Port 4: Store data │ │ Port 5: ALU, vector shuffle │ │ Port 6: ALU, branch │ │ Port 7: Store address │ │ │ │ Can execute up to 8 µops/cycle out of order │ └─────────────────────┬───────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────┐ │ MEMORY SUBSYSTEM │ │ │ │ L1 Data Cache: 32KB, 4 cycles latency │ │ L2 Cache: 256KB, 12 cycles latency │ │ L3 Cache: 8-32MB, 40+ cycles latency │ │ Main Memory: DDR5, 70-100ns latency │ │ │ │ Critical for JVM: Object graph traversal → │ │ pointer chasing causes cache misses on L1/L2/L3 │ └─────────────────────────────────────────────────────┘

Branch Prediction and JVM Code

JVM bytecode dispatch in the template interpreter relies heavily on branch prediction. Hot loops in JIT code are predictable (same branch taken every iteration). Virtual dispatch through vtables is predictable for monomorphic call sites. The JIT generates type-check inlining guards that are predicted well by the CPU branch predictor.

Cache Effects on Java Object Graphs

Java's object-per-allocation model creates pointer-heavy data structures. Traversing an array of object references involves a pointer indirection per element — each reference load can miss the L1 cache. Key optimization: use value-type arrays (int[], long[]) or off-heap layouts to enable sequential memory access and CPU cache line prefetching.

// SLOW: pointer indirection, random cache misses
Integer[] boxed = new Integer[1_000_000];
long sum = 0;
for (Integer i : boxed) sum += i;  // each access = potential cache miss

// FAST: sequential memory, prefetched by CPU
int[] primitives = new int[1_000_000];
long sum = 0;
for (int i : primitives) sum += i;   // cache-line friendly, vectorizable

JIT Deoptimization and Branch Prediction

When C2 deoptimizes a method (due to class loading invalidating an inlined assumption), control returns to the interpreter at a specific bytecode. Frequent deoptimizations can harm branch predictor state. Monitor deoptimizations with -XX:+PrintDeoptimization -XX:+UnlockDiagnosticVMOptions.

Advanced JVM Interview Questions & Traps

Class Initialization Traps

Trap 1: Compile-time constant does NOT trigger initialization
static final int X = 5 is inlined by javac at every use site. Accessing it never loads or initializes the declaring class.

Trap 2: Accessing a superclass static through subclass
Child.parentStaticField initializes Parent, NOT Child. The JLS says initialization happens on the declaring class.

Trap 3: Static initializer exception leaves class in broken state
If <clinit> throws, the class is marked as failed. Every subsequent use of that class throws NoClassDefFoundError (not ExceptionInInitializerError — that's only on the first attempt).

String Pool Traps

String s1 = "a" + "b" + "c";   // compile-time: "abc" literal — pooled
String s2 = "abc";               // same pool object as s1
System.println(s1 == s2);         // TRUE

String a = "ab";
String b = a + "c";              // runtime concat — NOT pooled
System.println(b == "abc");     // FALSE

final String a2 = "ab";          // compile-time constant
String b2 = a2 + "c";            // compile-time concat → "abc" literal
System.println(b2 == "abc");    // TRUE — javac inlines the concat

Object Reference vs Object

void modify(Object o) {
    o = new Object();  // DOES NOT affect caller's reference
}
// Java passes references by VALUE — you can't change the caller's variable
// But you CAN mutate the object the reference points to

JVM Advanced Q&A

Question	Answer
Where are static variables stored in Java 8+?	In the `java.lang.Class` mirror object, which is on the heap. Not in Metaspace. Not on the stack.
Can the JVM collect a class from Metaspace?	Yes, if the `ClassLoader` that loaded it becomes unreachable. Then all classes it loaded (and their Metaspace data) are freed.
Does System.gc() guarantee collection?	No. It's a hint. The JVM may ignore it. Use `-XX:+ExplicitGCInvokesConcurrent` to make it trigger G1 concurrent cycle.
What is the difference between `==` and `equals()` for Integer?	`Integer` caches values -128 to 127 in an IntegerCache. `Integer.valueOf(100) == Integer.valueOf(100)` is true (cached). `Integer.valueOf(200) == Integer.valueOf(200)` is false (outside cache range). Always use `equals()` for Integer comparison.
What is a safepoint?	A point in execution where all Java threads are paused and the JVM has exclusive, consistent access to the heap. Required for GC, deoptimization, class redefinition, and stack sampling.
What causes megamorphic call site degradation?	More than 2 different concrete types observed at a virtual call site. C2 cannot inline megamorphic sites, and the JIT falls back to vtable dispatch, losing all inlining-dependent optimizations.
What is false sharing?	Two threads writing to different variables that share the same CPU cache line (64 bytes). Each write by one thread invalidates the other thread's cache copy, causing constant cache coherency traffic. Mitigate with `@Contended` padding or cache-line-aligned allocation.
What triggers JIT deoptimization?	Class loading events that invalidate CHA (Class Hierarchy Analysis) assumptions (e.g., a new subclass loaded after inlining), null check failures on speculated non-null refs, type profile changes at guarded inlines, and explicit deoptimization for debugging.
Why might -Xss affect thread count?	Each thread requires OS virtual memory for its stack (default 512KB–1MB). With 10,000 threads and 1MB stack size, that's 10GB of virtual address space just for stacks. Reducing -Xss allows more threads but risks StackOverflowError in deeply recursive methods.
What is on-stack replacement (OSR)?	JIT compilation of a currently-executing method's loop body, replacing the interpreter frame mid-execution with a compiled frame. Allows hot loops discovered at runtime to be compiled without waiting for the method to complete and be re-invoked.

Lock Ordering and Deadlock

// Classic deadlock pattern
synchronized (lockA) {
    synchronized (lockB) { }  // Thread 1: acquires A, then B
}

synchronized (lockB) {
    synchronized (lockA) { }  // Thread 2: acquires B, then A → DEADLOCK
}

// Detect with: jstack -l <pid> → look for "Found one Java-level deadlock:"

// Fix: always acquire locks in consistent global order
if (System.identityHashCode(lockA) < System.identityHashCode(lockB)) {
    synchronized (lockA) { synchronized (lockB) { } }
} else {
    synchronized (lockB) { synchronized (lockA) { } }
}

ThreadLocal Memory Leak

static ThreadLocal<HeavyObject> tl = new ThreadLocal<>();

// In a thread pool worker:
tl.set(new HeavyObject());
doWork();
// FORGOT: tl.remove();
// Thread returns to pool. ThreadLocal entry retained FOREVER.
// HeavyObject is never GC'd as long as the thread is alive.

// Fix: ALWAYS use try/finally to call tl.remove()
try {
    tl.set(new HeavyObject());
    doWork();
} finally {
    tl.remove();  // mandatory!
}

JVM Deep-DiveArchitecture Handbook

Complete Java Execution Pipeline

Full Pipeline Overview

Stage 1 — javac Compilation

Stage 2 — Class Loading

Stage 3 — Bytecode Verification

Stage 4 — Linking (Prepare + Resolve)

Stage 5 — Initialization

Stage 6 — Interpreter Execution (Tier 0)

Stage 7 — Tiered JIT Compilation

Stage 8 — Machine Code Execution

JVM Architecture — System Level

ClassLoader Subsystem

Runtime Data Areas

Execution Engine

Native Interface (JNI)

HotSpot Source Code Structure

Key Source Files for JVM Internals

OOP Hierarchy

Class Loading Internals

ClassLoader Hierarchy

Parent Delegation Model

Class Identity

Custom ClassLoader Implementation

Dynamic Module Loading (JPMS)

Linking — Deep Internals

Class Initialization Order — Exact Execution Rules

Complete Ordering Rules

Initialization Triggers (Active Uses)

Tricky Interview Case — Static Initialization Ordering

The Forward Reference Trap

Initialization Deadlock

JVM Bytecode Engine — Stack-Based Execution

Stack Frame Structure

Bytecode Example — Arithmetic

Complete Bytecode Instruction Reference

Method Invocation Types — Critical Distinction

Template Interpreter — How Bytecode Dispatches

Runtime Data Areas — Deep Internals

Complete Memory Map

Metaspace Deep Dive

Thread Stack — Frame Layout

HotSpot Object Memory Layout

Object Layout Diagram

Mark Word States

Compressed OOPs

Identity Hash Code in Mark Word

Static vs Instance Variables — Memory Diagrams

Key Rules

Heap Architecture Internals

Generational Heap Layout

Object Allocation Fast Path

Promotion Rules

Thread-Local Allocation Buffers (TLAB)

TLAB Sizing and Refill

PLAB — Promotion-Local Allocation Buffers

Garbage Collection Algorithms — Deep Internals

Generational Hypothesis

Serial GC

Parallel GC (Throughput Collector)

G1 GC — Garbage First

ZGC — Z Garbage Collector

Shenandoah GC

CMS (Concurrent Mark Sweep) — Deprecated

GC Algorithm Comparison

GC Barriers — Write and Read Barriers

Write Barrier — Card Marking

SATB Barrier — G1 / Shenandoah

Load Barrier — ZGC

Performance Impact of Barriers

Safepoints — Stopping the World

How Safepoints Work

Safepoint Polling Mechanism

Deoptimization at Safepoints

Time-to-Safepoint Latency

JIT Compiler Architecture — C1 and C2

Tiered Compilation Levels

C1 Compiler Pipeline

C2 Compiler Pipeline

Compilation Thresholds

JVM Deep-Dive
Architecture Handbook